Feature: Add support for L40 FusedMoE in cutlass path #1973

amirkl94 · 2025-10-23T12:11:36Z

📌 Description

Fixed a few compilation issues for L40, and removed 1 gemm tactic for sm == 89 that crashes due to:

Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel

🧪 Tests

Ran pytest tests/moe/test_trtllm_cutlass_fused_moe.py manually on an L40 GPU and verified all tests passed.

Summary by CodeRabbit

New Features
- Official support for SM89 target: build/JIT flags and a public generation path to target it.
Bug Fixes / Compatibility
- Clarified FP8/FP4 dispatch: FP8 paths enabled for SM89; FP4 usage remains gated and now requires explicit enablement.
Performance
- Adjusted kernel/tile selection order for certain FP8 paths to prefer SM89-optimized options.
Chores
- Reduced logging severity for failed tactic profiling to warn/debug.

Signed-off-by: Amir Klein <[email protected]>

gemini-code-assist · 2025-10-23T12:11:56Z

Summary of Changes

Hello @amirkl94, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for L40 GPUs (SM89 architecture) within the CUTLASS FusedMoE kernels. It resolves critical compilation issues and a runtime crash related to shared memory allocation for a specific GEMM tactic on SM89, ensuring stable and efficient operation. The changes also include updates to the Python JIT compilation infrastructure to enable proper module generation and loading for L40.

Highlights

L40 FusedMoE Support: Dedicated support has been added for NVIDIA L40 GPUs (SM89 architecture) within the CUTLASS FusedMoE path, enabling optimized performance for these devices.
GEMM Tactic Refinement: A specific GroupedGEMM kernel tactic that caused shared memory assertion failures on SM89 (L40) has been identified and removed from the candidate tile selection, enhancing stability.
Compilation and FP Handling Fixes: Several compilation issues related to the handling of FP4 and FP8 data types across different Streaming Multiprocessor (SM) architectures have been resolved, improving code robustness.
Python API Extension: The Python API has been extended to allow explicit selection and loading of SM89-specific FusedMoE modules, integrating the new L40 support into the existing framework.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2025-10-23T12:12:02Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Add explicit SM89 support and adjust FP4/FP8 gating across CUTLASS heuristics, MOE GEMM dispatch, and Flashinfer JIT: new SM89 NVCC flags and JIT module generator, tightened ENABLE_FP4 guards, reordered FP8 GROUPED_GEMM tile choices for SM89, and a logging-level tweak in the autotuner.

Changes

Cohort / File(s)	Summary
CUTLASS heuristic `csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`	Split FP8 GROUPED_GEMM condition so SM89 is handled separately (`sm == 89` vs `sm >= 120`) and reordered/changed the returned `CutlassTileConfig` list for the SM89 branch.
MOE GEMM dispatch `csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h`	Reorganized SM80–90 dispatch to prioritize FP8 when `use_fp8`/`use_w4afp8` are set; FP4 handling is confined under `ENABLE_FP4` and dispatch structure/formats adjusted (static_assert repositioned).
MOE specialization guards `csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h`	Broadened preprocessor guards to require `ENABLE_FP4` alongside CUTLASS SM120 support for SM120/Blackwell MOE specializations (gates FP4/FP8 pathways when `ENABLE_FP4` not defined).
Flashinfer fused MoE dispatch `flashinfer/fused_moe/core.py`	Added backend `"89"`: import and dispatch to `gen_cutlass_fused_moe_sm89_module(...).build_and_load()` when `backend == "89"`.
Flashinfer JIT NVCC flags `flashinfer/jit/core.py`	Added `sm89_nvcc_flags` constant (compute_89 / sm_89 and FP8 E8M0 flags).
Flashinfer SM89 JIT module generator `flashinfer/jit/fused_moe.py`	Added `gen_cutlass_fused_moe_sm89_module(use_fast_build: bool = False) -> JitSpec`; composes NVCC flags from `sm89_nvcc_flags` plus BF16/FP8 (and optional FP8_BLOCK_SCALE for CUDA ≥ 12.8) and delegates to generic generator with `device_arch="89"`; exported/imported `sm89_nvcc_flags`.
Autotuner logging `flashinfer/autotuner.py`	Changed failure logging in `choose_one`: profiling tactic failures now logged as warnings; detailed failure info logged at debug level; behavior unchanged.
Manifest `CMakeLists.txt`	Listed in manifest; no behavioral changes described.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant PyCore as flashinfer/fused_moe/core.py
    participant JITGen as fused_moe.gen_cutlass_fused_moe_sm89_module
    participant JITCore as flashinfer/jit/core.py
    participant CppDisp as moe_gemm_template_dispatch.h
    participant CUTLASS as cutlass_kernels

    User->>PyCore: get_cutlass_fused_moe_module(backend="89")
    PyCore->>JITGen: gen_cutlass_fused_moe_sm89_module(use_fast_build)
    JITGen->>JITCore: request NVCC flags (sm89_nvcc_flags + BF16/FP8 [+FP8_BLOCK_SCALE?])
    JITGen->>PyCore: build_and_load() -> compiled module
    PyCore-->>User: return loaded module

    Note over CppDisp,CUTLASS: MOE GEMM FP format dispatch
    alt FP8 GROUPED_GEMM & sm == 89
        CppDisp->>CUTLASS: select SM89-specific tile configs (new order)
    else FP8 GROUPED_GEMM & sm >= 120
        CppDisp->>CUTLASS: select SM>=120 tile configs
    else
        CppDisp->>CUTLASS: fallback/default tile configs
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Potential attention points:

csrc/.../cutlass_heuristic.cpp — verify SM89 tile config ordering and correctness.
moe_gemm_template_dispatch.h — ensure FP8/FP4 gating preserves intended dispatch and static_assert semantics.
moe_tma_warp_specialized_traits.h — confirm ENABLE_FP4 gating matches target build configurations.
flashinfer/jit/* and flashinfer/fused_moe/core.py — check NVCC flag composition and conditional FP8_BLOCK_SCALE inclusion.
flashinfer/autotuner.py — ensure logging level changes are acceptable for monitoring and debugging.

Suggested reviewers

yzh119
yongwww
cyx-6
wenscarl
joker-eph
aleozlx
djmmoss

Poem

🐰 I hopped through flags and kernels today,
SM89 found its snug new way.
Tiles reordered, guards made tight,
JIT and heuristics hum at night.
A small code nibble — nibbles done, hooray!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "Feature: Add support for L40 FusedMoE in cutlass path" is directly related to the changeset. The raw summary shows multiple additions and modifications that collectively enable L40 (SM89) support, including: adding SM89-specific NVCC flags, introducing a new gen_cutlass_fused_moe_sm89_module function, adding backend "89" support in the fused MoE module, and adjusting GEMM tactics and preprocessor guards for SM89. While the PR description indicates the work is primarily about fixing compilation issues and removing a problematic tactic, the title accurately captures that L40 support is being added/enabled through these changes.
Description Check	✅ Passed	The PR description includes the two most critical sections: a clear description of what was changed (fixed compilation issues for L40 and removed a problematic GEMM tactic with explanation of why) and test verification (manually ran tests on L40 and verified they passed). While the description does not fully follow the template structure—missing sections like Related Issues, the pre-commit checklist, and Reviewer Notes—it provides the essential information needed to understand the purpose and scope of the changes. The content that is present is substantive and directly relevant to evaluating the pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31c11e0 and ddc2ac2.

📒 Files selected for processing (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (5)

cutlass (114-116)

cutlass (120-122)

cutlass (127-129)

cutlass (132-134)

cutlass (140-142)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (2)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2)

53-59: Blackwell gating/parentheses fix looks correct.

Balanced for both branches and isolates FP4 references under ENABLE_FP4. This resolves the earlier unbalanced-paren hazard.

If you want to be extra sure, compile once with and once without -DENABLE_FP4 to validate both preprocessor paths.

99-103: I need to examine the definitions and broader context to determine if SM120 should be included in the aggregator.

I need to examine the actual SM120 specialization definition to understand whether it uses TMA warp specialization.

Based on my analysis of the codebase and architecture documentation, I can now provide the final assessment.

SM120 (consumer Blackwell, compute capability 12.0) is not included in the TMA warp-specialized aggregator because it uses a fundamentally different implementation from Hopper and SM100 Blackwell architectures. Consumer Blackwell architecture (compute capability 12.0) differs from data center Blackwell architecture (compute capability 10.0) in major ways, primarily lacking Tensor Memory. The NVIDIA Blackwell SM100 architecture used in datacenter products has a different compute capability than the one underpinning NVIDIA Blackwell GeForce RTX 50 series GPUs (SM120), and kernels compiled with SM100a features are not compatible with RTX 50 series GPUs.

The SM120 MOE support is handled through separate dispatch logic (lines 595, 625 in moe_gemm_template_dispatch.h) rather than inclusion in the TMA warp-specialized aggregator. This is correct by design—SM120 uses blockwise/groupwise GEMM kernels, not TMA warp specialization.

Likely an incorrect or invalid review comment.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds support for L40 GPUs (sm_89) in the CUTLASS FusedMoE path. The changes include fixing compilation issues, removing a problematic GEMM tactic for sm_89 that was causing crashes, and adding the necessary build configurations for this architecture. The changes are logical and well-implemented. I have one suggestion to improve code clarity when constructing compiler flags.

flashinfer/jit/fused_moe.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)
42-42: Fix misleading comment about SM120 vs SM100.

The comment mentions CUTLASS_ARCH_MMA_SM100_SUPPORTED but this function checks for CUTLASS_ARCH_MMA_SM120_SUPPORTED (line 35). Update the comment to accurately reflect the macro being checked.

Apply this diff:
-  return false;  // CUTLASS_ARCH_MMA_SM100_SUPPORTED is set when Blackwell kernels are enabled
+  return false;  // CUTLASS_ARCH_MMA_SM120_SUPPORTED is set when Blackwell kernels are enabled

🧹 Nitpick comments (2)

flashinfer/jit/fused_moe.py (1)
80-88: SM89 module generation is correctly implemented.

The function appropriately:

Uses sm89_nvcc_flags which excludes FP4 support for L40

Omits Hopper-specific TMA GEMM flags (correct for Ada architecture)

Includes conditional FP8 block scale support for CUDA ≥12.8

Optional: Consider iterable unpacking for cleaner syntax.

As suggested by Ruff, you could use iterable unpacking instead of concatenation:
-    nvcc_flags = sm89_nvcc_flags + [
+    nvcc_flags = [
+        *sm89_nvcc_flags,
         "-DENABLE_BF16",
         "-DENABLE_FP8",
         "-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
         "-DUSING_OSS_CUTLASS_MOE_GEMM",
     ]
This is a minor style improvement and can be deferred.
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
696-709: LGTM: Correctly reorganizes SM89 dispatch to avoid shared memory issues.

The reorganized control flow properly addresses the L40 (SM89) issue by:

Routing FP8 workloads to Sm89 kernels with runtime validation (line 703)

Routing non-FP8 workloads to Sm80 kernels (lines 707-708)

This aligns with the kernel implementation in moe_cutlass_kernel.h which shows SM89 architecture reusing Sm80 kernels for non-FP8 types, and prevents the "GPU lacks the shared memory resources" assertion mentioned in the PR objectives.

Optional suggestion: Consider adding a brief comment explaining why non-FP8 on SM89 uses the Sm80 path, to help future maintainers understand the shared memory constraint that motivated this design.

Apply this diff to add a clarifying comment:
     } else {
+      // Non-FP8 workloads on SM89 (L40) reuse Sm80 kernels to avoid
+      // Sm89-specific tactics that exceed shared memory limits
       dispatchMoeGemmToCutlass<T, WeightType, ScaleBiasType, cutlass::arch::Sm80, EpilogueTag>(
           inputs, multi_processor_count_);
     }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 739df61 and d490c3f.

📒 Files selected for processing (6)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
flashinfer/fused_moe/core.py (2 hunks)
flashinfer/jit/core.py (1 hunks)
flashinfer/jit/fused_moe.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

flashinfer/fused_moe/core.py (2)

flashinfer/jit/fused_moe.py (1)

gen_cutlass_fused_moe_sm89_module (80-87)

flashinfer/jit/core.py (1)

build_and_load (272-284)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (2)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (6)

__nv_fp8_e5m2 (91-93)

cutlass (114-116)

cutlass (120-122)

cutlass (127-129)

cutlass (132-134)

cutlass (140-142)

csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h (1)

cutlass (40-677)

flashinfer/jit/fused_moe.py (2)

flashinfer/jit/core.py (2)

JitSpec (185-284)

gen_jit_spec (287-353)

flashinfer/jit/cpp_ext.py (1)

is_cuda_version_at_least (86-87)

🪛 Ruff (0.14.1)

flashinfer/jit/fused_moe.py

81-86: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (6)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

161-168: LGTM! Targeted fix for L40 shared memory constraints.

The separation of SM89 from SM >= 120 handling correctly removes CtaShape16x256x128_WarpShape16x64x128 for L40 GPUs. This config's larger K dimension (128 vs 64) would exceed L40's shared memory capacity in GROUPED_GEMM mode, causing the assertion failure mentioned in the PR. The fix is minimal, well-scoped, and maintains full functionality for newer architectures.

flashinfer/jit/core.py (1)

93-96: SM89 flags correctly omit FP4 support.

The implementation appropriately excludes -DFLASHINFER_ENABLE_FP4_E2M1 for SM89 (L40), unlike other architectures that use common_nvcc_flags. This aligns with the PR objective of fixing compilation issues for L40 by removing problematic GEMM tactics.

flashinfer/fused_moe/core.py (2)

40-40: LGTM!

Import added correctly to support SM89 module generation.

289-290: LGTM!

Backend dispatch for SM89 follows the established pattern and correctly invokes the new module generation function.

flashinfer/jit/fused_moe.py (1)

21-27: LGTM!

Import correctly updated to include sm89_nvcc_flags needed for the new SM89 module generation function.

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

692-694: LGTM: Proper FP4 validation for SM89.

The explicit compile-time check correctly prevents FP4 usage on SM < 90, consistent with the earlier validation for SM75-80 at lines 685-688.

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h

Signed-off-by: Amir Klein <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d490c3f and 50a1bb0.

📒 Files selected for processing (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

35-36: LGTM: Guard is appropriate for FP4-only SM120 specialization.

The ENABLE_FP4 guard is correct here. Since isValidSM120MOESpecialisation only supports __nv_fp4_e2m1 types (lines 37-40), requiring ENABLE_FP4 at the preprocessor level prevents compilation errors when FP4 support is not available.

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h

djns99

Looks good to me, mainly just want to understand why we need to disable the tile shape.
Coderabbit's comment about more granular Fp4 guard might make sense, but I assume that if we are compiling with blackwell support we should also have FP4

djns99 · 2025-10-23T21:03:16Z

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp

+                  CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,
+                  CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,
+                  CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,
+                  CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};


Are there any SM89 GPUs that can support the CtaShape16x256x128_WarpShape16x64x128, or is this an SM120 addition?
Having a small M value like this helps with low latency cases, so I'd want to understand why its not supported before disabling it

At the very least can you leave a comment saying what the difference between the two lists are, so people dont have to manually compare the items

Are there any SM89 GPUs that can support the CtaShape16x256x128_WarpShape16x64x128, or is this an SM120 addition? Having a small M value like this helps with low latency cases, so I'd want to understand why its not supported before disabling it

This is a removal from the sm89 path. When I tested it on an L40 GPU I got Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel .
It might be that on other sm89 GPUs it will pass, the main issue is that this was the default tactic that was chosen when trying to use FusedMoE, I believe that moving it to be the last will also fix my issue.

I tried moving this tile config to be the last one and now the default tactic won't fail on l40. The issue is that if autotuner is on then the tactics that use this tile config will report an error with a stacktrace which looks bad.
@yzh119 do you think it'll be ok to change the errors that happen in the autotuner to be debug logs? Otherwise it means users will get spammed with error messages when they run autotuning on L40 FusedMoE.

I think the autotuner should still output warnings, but just make them say "Skipping tactic x due to error. This tactic may not be supported by the current GPU architecture".
That said I know there is a difference of opinion on whether we should proactively filter them as you have done here, the argument being that we should be able to do the due diligence to determine what tactics are supported so that we can raise an error when a tactic fails when it shouldn't. So I can see either side.

I agree. I think that since this PR is somewhat time critical I'd rather move the faulty tile config to be the last so as not to remove it, and change the log to be a warning as you suggested and maybe a debug log with the stacktrace.
I'll also open an issue to take a deeper look into regarding if this specific tile config is relevant for sm89 or not as I think it might take me some time.
@djns99 , @yzh119 Does that sound ok to you?

I'm happy with that resolution

yzh119 · 2025-10-24T00:01:31Z

There are some compilation errors on CI, @amirkl94 would you mind taking a look?
https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-1973/4/pipeline

Signed-off-by: Amir Klein <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
691-699: LGTM: FP8/W4AFP8 routing to SM89 is correct.

The restructured logic correctly routes FP8 and W4AFP8 workloads to the SM89 architecture for the SM 80-89 range, which is essential for L40 support. The runtime check at line 697 appropriately prevents FP8 usage on SM80-88 hardware that lacks FP8 support. The static_assert placement within the ENABLE_FP8 block is consistent with the codebase pattern (see lines 244-247).

Minor: Consider clarifying the error message.

The error message at line 697 mentions "fp8" but the condition includes use_w4afp8. For clarity, consider updating the message to reflect both supported types.
-  TLLM_CHECK_WITH_INFO(sm_ == 89, "For sm >= 80 and < 90, fp8 is only supported with sm == 89");
+  TLLM_CHECK_WITH_INFO(sm_ == 89, "For sm >= 80 and < 90, fp8 and w4afp8 are only supported with sm == 89");

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50a1bb0 and 2302b53.

📒 Files selected for processing (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (6)

__nv_fp8_e5m2 (91-93)

cutlass (114-116)

cutlass (120-122)

cutlass (127-129)

cutlass (132-134)

cutlass (140-142)

csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h (1)

cutlass (40-677)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)

multi_processor_count_ (341-341)

🔇 Additional comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

700-712: LGTM: Non-FP8 handling and FP4 guard are correct.

The else branch correctly handles non-FP8 cases for SM 80-89:

FP4 types are appropriately blocked on SM < 90 (line 703)

Non-FP8 workloads dispatch to SM80 architecture, which is an intentional kernel reuse strategy for SM89 as documented in the kernel implementation

The duplication of the dispatch call (lines 705-706 and 709-710) is necessary for conditional compilation when ENABLE_FP4 is not defined.

amirkl94 · 2025-10-27T15:01:52Z

There are some compilation errors on CI, @amirkl94 would you mind taking a look? https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-1973/4/pipeline

Pushed a commit to fix this

Signed-off-by: Amir Klein <[email protected]>

…ses on L40 Signed-off-by: Amir Klein <[email protected]>

Signed-off-by: Amir Klein <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2302b53 and 31c11e0.

📒 Files selected for processing (3)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
flashinfer/autotuner.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp

🧰 Additional context used

🧬 Code graph analysis (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (5)

cutlass (114-116)

cutlass (120-122)

cutlass (127-129)

cutlass (132-134)

cutlass (140-142)

csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h (1)

cutlass (40-677)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (2)

flashinfer/autotuner.py (1)

485-492: LGTM! Appropriate logging level adjustment.

The change from error to warning/debug levels is sensible for profiling failures during autotuning. Since failed tactics are automatically skipped (time set to infinity) and the fallback mechanism ensures operation continues, treating these as errors would create unnecessary log noise. The warning level appropriately notifies users that a tactic was skipped, while debug level preserves diagnostic details for troubleshooting.

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

35-36: LGTM! Guard correctly requires ENABLE_FP4.

The guard appropriately requires both CUTLASS_ARCH_MMA_SM120_SUPPORTED and ENABLE_FP4 since this specialization exclusively validates FP4 types (__nv_fp4_e2m1), which are only defined when FP4 support is compiled in.

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h

Signed-off-by: Amir Klein <[email protected]>

yzh119

LGTM, should be ready to merge once CI passed.

yzh119 · 2025-10-28T18:32:33Z

/bot run

flashinfer-bot · 2025-10-28T18:33:13Z

GitLab MR !94 has been created, and the CI pipeline #37466918 is currently running. I'll report back once the pipeline job completes.

nvmbreughe · 2025-10-28T19:11:10Z

LGTM

flashinfer-bot · 2025-10-29T00:27:27Z

[SUCCESS] Pipeline #37466918: 13/17 passed

Feature: Add support for L40 FusedMoE in cutlass path

d490c3f

Signed-off-by: Amir Klein <[email protected]>

amirkl94 requested review from aleozlx, bkryu, cyx-6, joker-eph, wenscarl, yongwww and yzh119 as code owners October 23, 2025 12:11

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

flashinfer/jit/fused_moe.py Show resolved Hide resolved

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h Outdated Show resolved Hide resolved

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h Outdated Show resolved Hide resolved

Fix wrong env

50a1bb0

Signed-off-by: Amir Klein <[email protected]>

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h Outdated Show resolved Hide resolved

djns99 approved these changes Oct 23, 2025

View reviewed changes

Fix compilation issue

2302b53

Signed-off-by: Amir Klein <[email protected]>

amirkl94 requested a review from djmmoss as a code owner October 27, 2025 14:49

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

amirkl94 added 3 commits October 28, 2025 10:19

Fine grain ifdefs

19785b9

Signed-off-by: Amir Klein <[email protected]>

Change tile config order instead of removing so that default case pas…

6462feb

…ses on L40 Signed-off-by: Amir Klein <[email protected]>

Change autotuner logging

31c11e0

Signed-off-by: Amir Klein <[email protected]>

amirkl94 requested review from IwakuraRein and nvmbreughe as code owners October 28, 2025 08:23

coderabbitai bot reviewed Oct 28, 2025

View reviewed changes

.../nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h Outdated Show resolved Hide resolved

ifdefs

ddc2ac2

Signed-off-by: Amir Klein <[email protected]>

yzh119 approved these changes Oct 28, 2025

View reviewed changes

yzh119 merged commit 159d0a0 into flashinfer-ai:main Oct 29, 2025
4 checks passed

coderabbitai bot mentioned this pull request Oct 31, 2025

update trtllm cutlass moe #2020

Open

5 tasks

djns99 mentioned this pull request Nov 4, 2025

[https://nvbugs/5284463][fix] fix ada fp8 group gemm lacks shared memory NVIDIA/TensorRT-LLM#8878

Open

1 task

Feature: Add support for L40 FusedMoE in cutlass path #1973

Feature: Add support for L40 FusedMoE in cutlass path #1973

Uh oh!

Conversation

amirkl94 commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🧪 Tests

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Oct 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

djns99 left a comment

Choose a reason for hiding this comment

Uh oh!

djns99 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

djns99 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

amirkl94 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

amirkl94 Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djns99 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

amirkl94 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

djns99 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Oct 24, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

amirkl94 commented Oct 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Oct 28, 2025

amirkl94 commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading

amirkl94 Oct 27, 2025 •

edited

Loading